Among their many other roles, proteins are the scaffolds, workhorses, and computational devices of all organisms. For many pratical purposes, a protein is a string of charaters in a 20 letter alphabet. These charaters represent amino acids. A proten's 3D structure and biological function depend on its sequence of amino acids.
Proteins are typically composed of several functional subunits, called domains. These subunits have relatively autonomous function. Domains are shuffled through evolution in a mix and match process in which new proteins are created as combinations of existing domains.
Recent technologies advances and genome sequencing projects, have given us a very large number of protein sequences to analyze. However, our knowledge about higher properties of proteins, such as their shape and function is scarce, since it is much harder to experimentally derive such information.
We are still far from being able to deduce a protein's structure or function from its sequence. We approach the problem of deducing structure/function from sequence using homology modeling. The basic idea is to infer a protein's higher properties from those of other proteins which have similar sequences. Because of the mix and match process mention earlier, we believe it is best to employ the homology modeling scheme on the domains of a protein. However, even the parsing of a protein sequence to its domains is still an unsolved problem.
I will present a process we have developed for the identification and classification of protein domains in a comprehensive database of protein sequences. Our process combines methodologies of sequence similarity identification, graph based clustering, machine learning, statistical modeling and iterative refinement. We achieve state of the art results, recovering 63% of the known domain families and suggesting new families with about 40% fidelity.
This is joint work with Michal Linial and Nati Linial.